Project 1

Relationship of the number of COVID-19 death cases with the confirmed last week cases of each country and its WHO region

Introduction

The html version of this notebook is in the link [https://drive.google.com/drive/folders/1-jkQBXFr9dOV5RLfjMmz7rl6BwK0B4PP?usp=sharing].

Coronavirus (COVID-19) is a disease that can spread from person to person through close contact and sometimes by airborne transmission (CDC, 2020b). Older adults and people with medical conditions are more likely to get infected and become severely ill by COVID-19 (CDC, 2020a). It has been a year since WHO decleared the COVID-19 outbreak as a Public Health Emergency on 30th Jan. 2020 (WHO, 2020). As of today 3rd Feb. 2021, COVID-19 has taken the lives of 2,237,636 people worldwide and there have been over 100 million confirmed cases (WHO, 2021). The average timeline from hospitalization to a severe condition is about 7 days (Wang et al., 2020).

This report aims to find the relationship between the total number of COVID-19 Deaths cases and the total number of Confirmed last week cases of each country and its WHO Region. It uses the dataset of COVID-19 cases country wise as of 27th Jul. 2020 made by 'imdevskp' from GitHub. The COVID Confirmed last week cases reflect its total number of cases over time till 20th Jul. 2020 and WHO Region reflects the location and population mobility that can affect the spread of COVID, eventually affects the death tolls for countries in the region. The relationship between Deaths and Confirmed last week can help to understand the motality risks of COVID-19 for each country and to respond with appropriate measures. If a country has a higher number of Confirmed last week cases, it is considered as the higher chance of more patients develop a severe symptoms in the upcoming week, and reports more Deaths cases. If a country is in a WHO Region with higher average confirmed cases, then it implies more Deaths cases in that country.

This project is not only helpful to raise the awareness of the general public on the COVID severity and its spread, but also to uncover the relationship behind the deaths of each country and confirmed cases and other factors. The culmulative number of Deaths cases of each country is considered as the dependent variable, the cumulative number of Confirmed last week cases and WHO Region of each country are considered as the main independent variables, while other factors will be introduced for further analysis. Besides Deaths, Confirmed last week and WHO Region, the dataset introduced in this report also contains information such as cumulative cases (Confirmed, Deaths, Recovered), daily update on cases (Active cases, New cases, New recovered, New deaths) and growth rates (Deaths / 100 Cases, Recovered / 100 Cases, Deaths / 100 Recovered,1 week % increase) for each country. Moreover, the project intend to introduce some other country level characteristics such as total population total_pop to better explain the relationship.

Importing and Cleaning Data

It is helpful to import pandas, numpy and datetime packages for data analysis. The dataframe covid contains 187 entries in total.

The covid Dataframe:

The covid.describe() shows a summary statistics of the dataframe. The dataset does not contain any Nan in every columns, which does not need further modifications. Notably, the column Deaths / 100 Recovered consists some infinite numbers which causes the mean of Deaths / 100 Recovered to be inf and standard deviation to be NaN. Since Deaths / 100 Recovered is culmulative number of deaths over culmulative number of recovered, there are possibilities that some countries do not currently have any recovered population. Thus, the inf values should be marked and NaN so that it will not affect the mean and standard deviation of Deaths / 100 Recovered.

The summary table indicates that the confirmed, recovered and active cases vary a lot from countries to countries as the standard deviations are big. The median of the total confirmed cases is at 5000 and 75% of the countries keep their total confirmed cases below 40,000. While the average total cases for each country until July 27th is at about 90,000. While This shows that there are some countries who have a large number of confirmed cases, which causes the mean to deviate a lot from the median. The same wide deviation between mean and median also appears in active and recovered cases. The death cases have a smaller standard deviation than confirmed cases. However, it is still noticeable that the mean is largely different from the median, even the 75-percentile. I further investigate the distributions in the analysis.

Summary statistics of the numerical variables

This section focused on the relationship between the independent variable Confirmed last week and the dependent varibale Deaths. Using the function of .describe(), we can see the summary statistics of two numerical variables and plot a histogram respectively.

Analysis on the total death cases

From the summary statistics, the mean of the death tolls for a country is 3,497 while the median is 108. This indicates the distribution of total Deaths is right-skewed, which is consistent with the histogram. The range of the death tolls for each country is from 0 to 148,011, which indicates that there is a large variance for the death cases between countries. From the histogram, most of the coutries kept death tolls under 1000 cases. The reason for large difference between the mean and median of death tolls is that there are some outliers. For example, US, Brazil, Mexico and UK have death tolls over 40,000 and US accounts for the highest number of death cases around the world.

Analysis on the total confirmed cases until last week

From the summary statistics, the mean of the total confirmed cases till last week for a country is 78,682 while the median is 5,020. This indicates the distribution of total confirmed last week is right-skewed as well, which is consistent with the histogram. The range of the confirmed cases for each country is from 10 to 3,834,677, which indicates that there is a large standard deviation of 338,273 between countries. From the histogram, most of the coutries kept confirmed cases under 10,000 cases. The reason for large difference between the mean and median of death tolls is that there are some outliers. For example, US, Brazil, India and Russia have confirmed cases over 500,000 till 20th Jul. 2020, and US still accounts for the highest number of confirmed cases around the world.

Correlation between the numerical variables

Examing the correlation table between the numberical variables from the dataset covid, we can notice that the cumulative number of Deaths cases is positively correlated with most of the variables, except slightly negatively correlated with Recovered / 100 Cases and 1 week % increase. Deaths cases are strongly correlated with Confirmed, Confirmed last week, Active, Recovered, New deaths, New cases, 1 week change, and slightly correlated with the death rates.

Summary statistics of the categorical independent variable

This section focused on the relationship between the categorial independent variable WHO region and the dependent varibale Deaths. Using the function of .describe() and .groupby(), we can see the summary statistics of the Deaths cases in each region and plot a histogram respectively.

Analysis on the total death cases for each WHO Region

With subgrouping the death cases for each country by WHO Region, Europe has the greatest amount of country wise samples while South-East Asia has the smallest amount.

Between different WHO Regions, location effects are added for the analysis of death cases. The summary statistics table indicates that the mean of Deaths tolls varies between region from 9,792 in Americas to 254 in Africa. In Americas, the median of the death cases for each country is only 115 ranked the 3rd among 6 regions, while its maximum death cases in one country, the US, is the highest with 148,011 among all 6 regions. From the boxplot, all the Deaths distribution of 6 regions are right-skewed. The speards of the Deaths cases for different regions vary. The Deaths distribution in South-East Asia and Americas regions are more variable than that in other regions. The Deaths distribution in Americas has the farthest outliers among all 6 regions which is the total death cases of the US.

Correlation between Deaths and Confirmed last week grouped by WHO Region

With subgrouping the dataset into 6 WHO Regions, the relationship between Deaths and Confirmed last week is still positive in each group. There is a stronger relationship between the two variables in Africa, Americas and South-East Asia regions as the absolute values of the correlation are above the correlation without grouping. This implies that the countries with higher total number of Confirmed last week in these three regions are more likely to have higher total number of Deaths cases one week later.

Further steps

The other variables in the covid dataset which has a strong correlation with the total Deaths cases should be considered for further analysis. For example, the total Confirmed cases could correlate with the total Deaths cases.

I intend to import the country level characteristis such as total population, land size, aged population over 65 years old and other information from a creditable source such as World Bank to help explain the relationship of the Deaths with Confirmed last week and other factors. With the information of total population of each country, I can take the next step to see the relationship between the population and Deaths cases. The aged population can also be a factor that influence the Deaths cases for each country. The location (latitude and longitude) of each country can also be an valuable information to plot the data in a more visualized way.

Citations

CDC. (2020a, February 11). COVID-19 and Your Health. Centers for Disease Control and Prevention. https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/index.html

CDC. (2020b, October 28). How Coronavirus Spreads. Centers for Disease Control and Prevention. https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/how-covid-spreads.html

Wang, F., Qu, M., Zhou, X., Zhao, K., Lai, C., Tang, Q., Xian, W., Chen, R., Li, X., Li, Z., He, Q., & Liu, L. (2020, July 3). The timeline and risk factors of clinical progression of COVID-19 in Shenzhen, China. Journal of Translational Medicine. https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-020-02423-8

WHO. (2020, December 23). A year without precedent: WHO’s COVID-19 response. World Health Organization. https://www.who.int/news-room/spotlight/a-year-without-precedent-who-s-covid-19-response

WHO. (2021, February 3). WHO Coronavirus Disease (COVID-19) Dashboard. World Health Organization. https://covid19.who.int

Project 2

THE MESSAGE

Continuing my analysis from Project 1, project 2 aims to visualize the relationship between Deaths and Confirmed last week for the spread of COVID-19 worldwide. The first part of project 2 focuses on analyzing the correlation between the dependent variable Deaths and the independent variable Confirmed last week, furthermore introducing WHO Region and pop_cut to divide the data into subgroups. In subgroups, I tried to fit regression lines to see any differece in their correlations. The second part of project 2 relies on mapping the independent and dependent variables, Confirmed last week, total population 2019 and Deaths. The interactive maps provide the comparison and findings in an intuitive way.

We expect to see a positive relationship between cumulative deaths and confirmed cases as the more confirmed cases in one country, the higher likelihood that the country has a higher death cases. In comparison with WHO Region, the deaths cases and confirmed cases are positively corelated as well. A country that is considered as high population division is more likely to have a higher confirmed cases and a higher death cases.

The html version of this notebook is in the link [https://drive.google.com/drive/folders/1-jkQBXFr9dOV5RLfjMmz7rl6BwK0B4PP?usp=sharing].

Visualization

Based on the current covid dataset, I want to visualize the relationship between cumulative deaths and cases, which helps to predict the ratio between deaths and confirmed cases of each country. I am interested to see if the total population of the country have an effect on this relationship.

First, I imported and cleaned the population dataset of each country, total_pop, from the World Bank data in 2019. There are some problems raised when merging total_pop and covid. For example, some country names have extra string or different abbreviations. To solve this, I introduced a new package called pycountry_convert to create a new column for country_code in the covid dataset. I merged the total_pop with covid based on the condiction that either the country names or the country codes are matched. I classfied the population of each country into a categorical value, pop_cut, in terms of the approximate four quantiles of the total_pop distribution. The population division is for population above 30 million as high, for population between 9.5 million and 30 million as higher middle, for population between 2.3 million to 9.5 million as low middle and the rest as low. The merged dataset is stored in covid2 and contains 184 observations.

Scatter plot for Deaths and Confirmed last week by WHO Region

I chose to create a scatter plot to see the relationship between Deaths and Confirmed last week by WHO Region. The pandemic spread across the countries at a fast rate and the cumulative cases are in exponential growth. So I decided to take log for both Deaths and Confirmed last week variables to respond to the skewness in large scales of Confirmed last week in some countries. The log-log plot would provide information on whether there is a power law relationship between two varaibles.

The log-log plot shows that the data follows a straight line which displays a power law relationship between Confirmed last week and Deaths. The positive relationship between Deaths and Confirmed last week indicates that for an country with a higher death tolls, it usually has higher confirmed cases. Meanwhile, for a country with a higher confirmed cases, it is more likely to have a higher death cases. Moreover, I added several reference lines to exhibit the likelihood of deaths in terms of the confirmed cases as of July 27, 2020. The likelihood of death by COVID in most of countries is below 5%. The reference lines confirm that in Americas and Europe the motality risks for most of the countries cluester in the range of 2% to 10%. In Eastern Mediterranean the range of the risks is wider from 20% to 0.25% between each country. While it is noticeable in Africa the risks are below 5% and the number of deaths and confirmed cases are lower. For further analysis, I will divide the data into subplots by each WHO Region.

Subplots for Deaths and Confirmed last week by WHO Region

The facet log linear regression graph shows a positive corelation between cumulative confirmed and death cases in all 6 regions. However, it displays different y-axis range between subplots, which indicates the motality risks vary between regions. The Americas subplot is the one with the highest number of deaths shown on its y-axis. This means that it has a highest chance of deaths due to COVID in Americas region the next week, if someone is tested positive on the week of July 20th, 2020, followed by in Europe region. However, this graph also indicates Africa and Asia regions have lower risks of death while more than 60% of the world population live in the two continents. For further analysis, I took the total population for each country into account in addition to WHO Region.

Scatter plot for Deaths and Confirmed last week by WHO Region and population of a country, pop_cut

The colored legend of each subplots shows that the countries classified as high popluation tend to have a higher deaths cases and confirmed cases in all 6 regions. In Americas, the population distribution accords closely with deaths cases. The more population the country has, the more COVID Deaths cases in that country. In other regions, some countries with high population have a lower COVID deaths cases than countries with lower population. In Africa, Eastern Mediterranean and South-East Asia, some countries with higher middle population have lower deaths cases than countries with low population. Especially, in Africa, there are some countries with high and high middle population that have small number of reported confirmed and deaths cases. This might show the limitation of the reported cumulative death and confirmed cases as the reported cases depend on a country's testing ability and other complex reasons. For example, people who have pre-existing diseases are more vulnerable in the pandemic period which are not counted as COVID deaths cases.

Mapping

I first imported the packages I need and the world map data from Natural Earth outlines the countries. I also noticed there are some country names in covid dataset that are not matched with the country in the map data. So I first clean the names and created a new column for country codes to be prepared for merging it with the map. The worldcovid dataset was created based on either the country names or the country codes are matched. As for the range of the color legend, I chose to take log on the variables I want to display since the cumulative cases increased exponentially and population varies a lot between countries.

The interactive graphs below is provided in html version for submission on Google drive. Here is the link to the folder [https://drive.google.com/drive/folders/1-jkQBXFr9dOV5RLfjMmz7rl6BwK0B4PP?usp=sharing] containing every interactive graphs and a html version for the complete notebook. I also provided individual link for every graphs.

The map of covid death cases acorss the world

Link:[https://drive.google.com/file/d/1w62uMsLW5tyaT3ICaZIUldstVA_GoM3P/view?usp=sharing]

From the interactive graph on the cumulative Deaths cases in log scale, it is easy to observe that the darkness of the red colors matches with the countries which has the highest number of Deaths from the table in project 1. Intuitively, one would observe that United States and Brazil have the darkest color in the map which means they have thte highest number of cumulative death tolls till July 27, 2020 than any other countries, followed by India, Mexico, France and some more. By putting the mouse onto the country, we can see the detailed information of each country including Confirmed last week, Deaths, Deaths / 100 cases etc.

Meanwhile, the graph is showing the geographic WHO region on where the countries locates. Americas has higher death tolls than any other regions as most of the continent is covered by dark red colors. Europe is also covered by dark red color which indicates higher death tolls for most of its countries. Clearly, there is a location effect on the death cases for each country. If a nearby country have higher death cases, it means the nearby country has higher confirmed cases and there is higher possibility that the deaths cases and confirmed cases in the country will increase as COVID spreads.

However, the death cases could be influenced by other factors such as population, the proportion of aging population, stringency of social distance policy in different countries, etc.

The map of total covid cases confirmed till last week acorss the world

Link:[https://drive.google.com/file/d/1RlMitkc_Z3P1iSe-5PVB20KIL7ghL02p/view?usp=sharing]

From the interactive graph on the cumulative Confirmed cases till July 20, 2020 in log scale, it is easy to observe that the darkness of the colors matches with the countries which has the highest number of Confirmed from the table in project 1. By looking for areas colored with the darkest red, we can see that United States has the highest number of cumulative confirmed cases than any other countries. By putting the mouse onto the country, we can see the detailed information including Confirmed last week, Deaths, Deaths / 100 cases etc.

Meanwhile, the graph is showing the geographic WHO region on where the countries locates. Clearly, there is a location effect on the confirmed cases for each country. If a nearby country have higher confirmed cases, the country is more likely to have a similar large number of cumulative confirmed cases as COVID spreads. We can observe that countries that has large cumulative COVID confirmed cases mainly concentrated in Americas ,South-East Asia and Europe where those regions are covered with dark red color.

However, the confirmed cases could be influenced by other factors such as the testing ability, stringency of social distance policy in different countries, population, etc.

The map of population of each country across the world

Link:[https://drive.google.com/file/d/1cwHqVc2ddZTL1K4bcf3oQ6q-8luvU9RO/view?usp=sharing]

This map shows the population of each country for 2019 in log scale. Obiviously, the world's largest 5 countries are China, India, United States, Indonesia and Brazil as they are the darkest blue areas on the graph. The countries with largest population tend to have relative large cumulative Deaths, Confirmed last week and Confirmed cases, compared with two maps above and looking at each countries hover data. For example, United States, Brazil and Indian are among the countries with largest cumulative reported Deaths and Confirmed last week cases. The countries that are in yellow-green or blue-green color have smaller population and their cumulative cases are smaller correspondingly.

The previous visualization graph indicates that some countries in Africa with larger population reported smaller COVID cases. We can easily seen in this map that, for example, Nigeria, Egypt and Demoncratic Republic of the Congo have larger population than some of European countries or Canada, but their cumulative cases are much smaller. However, if you compare theses countries within Africa, their reported cases are indeed larger than other countries in Africa with smaller population. The exceptions may be the results of other factors that affect the cumulative cases such as the testing ability, stringency of social distance policy in different countries.

However, there are some expceptions against the agrument which larger population countries have larger cumulative cases. For example, China is the country with largest population in the world while its cumulative cases are lower than many countries with smaller population than China by putting the mouse onto the map. Canada is not in the top 20 countries with largest population shown as dark blue color, but its cumulative Deaths and Confirmed last week cases exceed countries with larger population such as Japan, China.

Conclusion

In Project 2, I took some further visualizing steps to demonstrate the relationship between cumulative Deaths and Confirmed last week. The analysis shows that the cumulative Confirmed last week is positively correlated with cumulative Deaths and the correlation varies in subgroups based on WHO region and pop_cut. Additionally, in each WHO Region, the correlation varies and countries with higher population tend to have higher reported cases. The location of the country in the map also affects the cumulative cases.

Generally, the cumulative Deaths cases is expected to be higher if the country has a high cumulative Confirmed last week cases. The more population a country has, the more cumulative Deaths will be reported. However, there are some exceptions that countries in Europe with relatively smaller population reported a lot of Deaths cases than other countries with larger population. In Africa region, countries reported a smaller number of Deaths cases even though it has the second largest population.

The relationship between Deaths and Confirmed last week can be interpreted as the likelihood of dying due to COVID. The risk varies between different countries and WHO Region. There are some limitations of the data which can affect the relationship between death and confirmed cases, therefore the prediction of the motality risk for each country. The reported cases relies on the different criteria and ability for testing of each country. There are people who doesn't show any symptoms and people who died with pre-existing diseases and COVID may not be counted as COVID-19 deaths cases. The deaths and confirmed cases could be affected by other factors such as stringency of the policy.

For further analysis, merging other country-level characteristics can help to explain the relationship and would be a complimentary for analysing the motality risk and for controling the spread of COVID. The testing rate and stringency of policy would be interesting to classify the countries and reveal some further findings.